Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ESQL: ST_EXTENT_AGG optimize envelope extraction from doc-values for cartesian_shape #118802

Merged
merged 31 commits into from
Dec 20, 2024

Conversation

GalLalouche
Copy link
Contributor

@GalLalouche GalLalouche commented Dec 16, 2024

This PR optimizes ST_EXTENT_AGG when performed on Cartesian shapes.

When we cartesian shapes to Lucene, we encode additional information in the binary, including the extent, so we can read the extent directly from the binary encoding instead of (re-)computing it per shape, and replace the (possibly very complicated) shape with a rectangle. At the moment, this is only done for Cartesian shapes, since for Geo shapes, we need to take dateline wrapping into account, which means it can't be directly encoded as a rectangle. We will deal with geo shapes in a future PR.

@GalLalouche GalLalouche force-pushed the optimization/st_extent branch from cd5b519 to a519779 Compare December 17, 2024 12:12
@GalLalouche GalLalouche marked this pull request as ready for review December 17, 2024 12:25
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Dec 17, 2024
@GalLalouche GalLalouche added >enhancement :Analytics/Geo Indexing, search aggregations of geo points and shapes Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) auto-backport Automatically create backport pull requests when merged :Analytics/ES|QL AKA ESQL and removed needs:triage Requires assignment of a team area label labels Dec 17, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine
Copy link
Collaborator

Hi @GalLalouche, I've created a changelog YAML for you.

@craigtaverner craigtaverner changed the title ESQL: ST_EXTENT_AGG binary extent optimization ESQL: ST_EXTENT_AGG optimize envelope extraction from doc-values for cartesian_shape Dec 18, 2024
Copy link
Contributor

@craigtaverner craigtaverner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are still a few things to fix.

return columnReader ? DOC_VALUES : NONE;
}

public boolean isColumnReader() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appear to be unused. I think we can remove this boolean from the enum entirely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Later I did see some code that could have used this, but at that point I also wondered if it would be better to just stick with the original boolean.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must be a left-over. Removed.


private void testBoundsBlockLoaderAux(
CoordinateEncoder encoder,
Supplier<Geometry> generator,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parameter not used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. Still unused.

Function<String, ShapeIndexer> indexerFactory,
Function<Geometry, Optional<Rectangle>> visitor
) throws IOException {
var geometries = IntStream.range(0, 20).mapToObj(i -> ShapeTestUtils.randomGeometryWithoutCircle(0, false)).toList();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this should call generator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still does not call generator.

iw.addDocument(doc);
}
}
var indices = IntStream.range(0, geometries.size() / 2).map(x -> x * 2).toArray();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for this only looking at every second geometry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verifying that the block loader extracts data from the correct docs.

* \_AggregateExec[[],[SPATIALSTEXTENT(location{f}#48,true[BOOLEAN]) AS extent],INITIAL,[
* minNegX{r}#73, minPosX{r}#74, maxNegX{rb#75, maxPosX{r}#76, maxY{r}#77, minY{r}#78],21]
* \_FieldExtractExec[location{f}#48][location{f}#48]
* \_EsQueryExec[airports], indexMode[standard], query[{"exists":{"field":"location","boost":1.0}}][
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can the EsQueryExec refer to the airports index when that was not in the query? This comment seems incorrect. Perhaps copied from another test? Either delete the comment, or generate the correct one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

@@ -6912,12 +7066,23 @@ private EsQueryExec assertChildIsGeoPointExtract(UnaryExec parent, boolean useDo
}

private EsQueryExec assertChildIsExtractedAsDocValues(UnaryExec parent, boolean useDocValues, DataType dataType) {
// TODO(gal) why is this OK To vacuously true?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this question mean? The line after it asserts that the child node is a FieldExtractExec, and that assertion is reflected in the method name above too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If attributesToExtract is empty, this test would pass, which is probably not what we want if useDocValues is true. I'll add a check that it isn't empty.

Copy link
Contributor

@craigtaverner craigtaverner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were no tests that cover the case where shapes were ingested without doc-values. If these existed, they would have failed since the current optimization triggers even if there are no doc-values. There are a few things to do to cover this scenario:

  • Update PhysicalPlanOptimizerTests to have the airportCityBoundary index with and without doc-values and make assertions that the plan is only optimized to load extents from doc values if they are there.
  • Add csv-spec tests that have cartesian_shapes with and without doc-values, and assert that the queries run with the same results in both.
  • Add SpatialPushDownCartesianShapeIT tests that test the same results with and without doc-values and filter pushdown

For this PR it is sufficient to have either the csv-spec, or the SpatialPushDownCartesianShapeIT tests asserting the same results. Having both is even better.

@GalLalouche GalLalouche force-pushed the optimization/st_extent branch from f823bc5 to d62e37d Compare December 19, 2024 18:15
NONE;

public static FieldExtractPreference forColumnReader(boolean columnReader) {
return columnReader ? DOC_VALUES : NONE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not really make sense to me. Both DOC_VALUES and EXTRACT_SPATIAL_BOUNDS will require a column reader. I suspect the situation is that the places this is used actually never test the EXTRACT_SPATIAL_BOUNDS case, so this does not matter. But once we add tests for that, it will.

Copy link
Contributor Author

@GalLalouche GalLalouche Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is for backward compatibility. I'll just inline it and add a comment so it's clearer.

Copy link
Contributor

@craigtaverner craigtaverner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the latest tests and the grouping bug fixed, I think this looks good to go!

@GalLalouche GalLalouche merged commit ef6372a into elastic:main Dec 20, 2024
16 checks passed
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.x

GalLalouche added a commit to GalLalouche/elasticsearch that referenced this pull request Dec 20, 2024
…Cartesian_shape (elastic#118802)

When we cartesian shapes to Lucene, we encode additional information in the binary, including the extent, so we can read the extent directly from the binary encoding instead of (re-)computing it per shape, and replace the (possibly very complicated) shape with a rectangle. At the moment, this is only done for Cartesian shapes, since for Geo shapes, we need to take dateline wrapping into account, which means it can't be directly encoded as a rectangle. We will deal with geo shapes in a future PR.
GalLalouche added a commit to GalLalouche/elasticsearch that referenced this pull request Dec 22, 2024
…Cartesian_shape (elastic#118802)

When we cartesian shapes to Lucene, we encode additional information in the binary, including the extent, so we can read the extent directly from the binary encoding instead of (re-)computing it per shape, and replace the (possibly very complicated) shape with a rectangle. At the moment, this is only done for Cartesian shapes, since for Geo shapes, we need to take dateline wrapping into account, which means it can't be directly encoded as a rectangle. We will deal with geo shapes in a future PR.
elasticsearchmachine pushed a commit that referenced this pull request Dec 22, 2024
…s for Cartesian_shape (#118802) (#119187)

* ESQL: ST_EXTENT_AGG optimize envelope extraction from doc-values for Cartesian_shape (#118802)

When we cartesian shapes to Lucene, we encode additional information in the binary, including the extent, so we can read the extent directly from the binary encoding instead of (re-)computing it per shape, and replace the (possibly very complicated) shape with a rectangle. At the moment, this is only done for Cartesian shapes, since for Geo shapes, we need to take dateline wrapping into account, which means it can't be directly encoded as a rectangle. We will deal with geo shapes in a future PR.

* Use oldstyle switch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL :Analytics/Geo Indexing, search aggregations of geo points and shapes auto-backport Automatically create backport pull requests when merged >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.18.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants